2 research outputs found

    Cracking KD-Tree: The first multidimensional adaptive indexing

    Get PDF
    Workload-aware physical data access structures are crucial to achieve short response time with (exploratory) data analysis tasks as commonly required for Big Data and Data Science applications. Recently proposed techniques such as automatic index advisers (for a priori known static workloads) and query-driven adaptive incremental indexing (for a priori unknown dynamic workloads) form the state-of-the-art to build single-dimensional indexes for single-attribute query predicates. However, similar techniques for more demanding multi-attribute query predicates, which are vital for any data analysis task, have not been proposed, yet. In this paper, we present our on-going work on a new set of workload-adaptive indexing techniques that focus on creating multidimensional indexes. We present our proof-of-concept, the Cracking KD-Tree, an adaptive indexing approach that generates a KD-Tree based on multidimensional range query predicates. It works by incrementally creating partial multidimensional indexes as a by-product of query processing. The indexes are produced only on those parts of the data that are accessed, and their creation cost is effectively distributed across a stream of queries. Experimental results show that the Cracking KD-Tree is three times faster than creating a full KD-Tree, one order of magnitude faster than executing full scans and two orders of magnitude faster than using uni-dimensional full or adaptive indexes on multiple columns

    Multidimensional adaptive & progressive indexes

    Get PDF
    Exploratory data analysis is the primary technique used by data scientists to extract knowledge from new data sets. This type of workload is composed of trial-and-error hypothesis-driven queries with a human in the loop. To keep up with the data scientist's productivity, the system must be capable of answering queries in interactive times. Given that these queries are highly selective multidimensional queries, multidimensional indexes are necessary to ensure low latency. However, creating the appropriate indexes is not a given due to the highly exploratory and interactive nature of such human-in-the-loop scenarios.In this paper, we identify four main objectives that are desirable for exploratory data analysis workloads: (1) low overhead over the initial queries, (2) low query variance (i.e., high robustness), (3) predictable index convergence, and (4) low total workload time. Given that not all of them can be achieved at the same time, we present three novel incremental multidimensional indexing techniques that represent three sample points on a Pareto front for this multi-objective optimization problem. (a) The Adaptive KD-Tree is designed to achieve the lowest total workload time at the expense of a higher indexing penalty for the initial queries, lack of robustness, and unpredictable convergence. (b) The Progressive KD-Tree has predictable convergence and a user-defined indexing cost for the initial queries. However, total workload time can be higher than with Adaptive KD-Trees, and per-query time still varies. (c) The Greedy Progressive KD-Tree aims at full robustness at the expense of only improving the per-query cost after full index convergence.Our extensive experimental evaluation using both synthetic and real-life data sets and workloads shows that (a) the Adaptive KD-Tree reduc
    corecore